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TO MY PARENTS 


ABSTRACT 


The clustered file organization and the search 
algorithms associated with it are studied in detail. A 
method of eStimating the number of records which are 
"'closest' to a given guery in a clustered file organization 
is implemented. Since the data base used to test the 
estimation does not satisfy exactly the assumption of the 
model, a higher error percentage than expected is obtained 
from the result and thus a modification scheme is developed 


to improve the accuracy of the estimation. 
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CHAPTER 1 


INTRODUCTION 


1.1 INFORMATION RETRIEVAL SYSTEMS 

Digital computers were originally designed for solving 
mathematical problems and the like. However with the 
constant demand of the public and the introduction of 
direct-access mass storage and sophisticated operating 
systems, it was transformed into a means to process 
information. Since then hundreds of information storage and 
retrieval systems have been developed. Some of them were 
multi-purpose, while the others were aiming at some specific 
functions. 

Basically an information retrieval system is composed 
of a data base, a retrieval subsystem and a data base 
maintainence subsystem. The retrieval subsystem is used to 
analyse the user queries and retrieve the appropriate 
information. The data base maintainence subsystem is used to 
keep the data base up-to-date by performing such functions 
as the addition of new records, the deletion of obsolete 
records and the modification of existing records. 

In general, according to the organization of the data 
base and the way that the information is retrieved, 
information retrieval systems can be broadly classified into 


two groups: Data Retrieval Systems and Document Retrieval 
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systems. 

The records of a data retrieval system are usually 
composed of a fixed number of fields or attributes. Each 
record is completely characterized by the values of its 
attributes and is uniguely identified by the value of a key 
fieid or attribute. For example, ina student record 
retrieval system, each record may consist of 4 attributes: 
student number, name, age and sex. Each record is uniquely 
identified by the value of the student number. The retrieved 
information has to satisfy exactly the conditions specified 
by the user queries. For example, the retrieved records to 


the query 


find AGE < 20 AND SEX = female 


represent the group of femaie students who are under 20 
years of age. 

Data retrieval systems are widely used by the general 
public, especially in the commercial sector, in which 
management decision problems as well as coutine inventory 
problems have to be solved. 

Document retrieval systems, on the other hand, deal 
With items which may not be completely characterized by the 
values of a fixed set of attributes. Example of items are 
books and pictures. A set of descriptors or keywords is used 
to represent and describe the content of the items instead. 


The choice of the keywords or descriptors is guite 


“y 


‘Yitswee 526 ad al ieoetase. Bish: 5 20 SbBAe5 roi § 

tas (ave peesdie UO a allot so sedans tient sito Soo 

23% 20 eonfer sad et ros taarsis sha (tesbiguey ak ug “ 

cot » lo esis odd ya BELA nah yisudinotak (Dae, ag 7d.ba: 
| iovses Joshite Bb nis yei' £4x5 yee isdudiiges. 20 eh 
puasodsetiew § Ja TOLONCD YSR H2G391 AaSS sede ye sided 4 
YlonpLaW Be: HSOSeT 2960 nee His °De , Shen yodmua 
bovaliaoey 95t iwéeus 7\ aGu3u 1d 20 suded wits el eee e 
Noa3 Lose Bass J awe Sait TOSAD YIellbe GP -ega feerer 
oF .ab3asos Seve i502 oat ,oiqmsxe 4 Salas, sven die 
{resp 


cos 
met bal 
v 


Sisuat = Ege GVA Of’ Oo SOA heat” 


QS @hSay op ow aettbure =f8ps8 Lo nd uit FOREE AS 


af } P pe, 
a «8ES- to @ . oe 


ie2eeep. 942 ‘ed Pee YSeorW Sis eeoteye fai dates atwa ea 


| aa we tatege ala a ain gh) ay ¢linisaqes Evatt i 
° a 


Lapsnoree Stine: ae sta: Bs. site bide sy weiaioeb jussspemee 


hadbae = oa Sway art 
sae ‘soto Tate "oad acs) yen Sate Lavaixsing SaKbsod ae 
oot ek Settasosaady : eodeyios i HOM aan toasty nao; 


Sy 3 


Pies shag cree ta 1 ae Beek Lo 


subjective. Usually the contents of an item are translated 
into a record of keywords by using a manually prepared 
dictionary. User queries are formed in the same way (by 
using the same dictionary). Queries are matched or 
correlated with the records within the data base and the 
records which have a substantial number of keywords in 
common with the queries are retrieved. Note that the 
retrieved records do not have to match the query exactly as 
in the case of the data retrieval system. 

In a data retrieval system, since all the records are 
completely characterized by a fixed number of attributes, 
the retrieved records are always acceptable to the users as 
long as they satisfy the conditions specified by the user 
query. However this is not the case in a document retrieval 
system. There are some factors we have to consider regarding 
the relevancy of the retrieved information. First, in 
forming a query, the user May not be able to choose the most 
appropriate keywords, thus the retrieved records may turn 
out to be different from those the user is expecting. 
Second, the keywords which were chosen by the indexer may 
have different meanings to the users. Third, the matching 
function used may not accurately measure the closeness 
between the query and the records. Thus we would have a 
Situation such that some of the closest records as measured 
by the matching function using the supplied keywords may not 
be considered relevant while some non-retrieved records may 


be of interest to the users. 
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One approach to alleviate the above problems in an on- 
dine processing environment is by employing an iterative 
process known as relevance feed back. Relevance feed back is 
a process such that the users, by responding to the 
displayed information, identify the relevant and irrelevant 
documents from the retrieved documents and have this” 
information fed back to the system. The system then alters 
the queries by adding heavier weights to the keywords of the 
relevant documeats and reducing the importance of the 
attributes found in the irrelevant documents. A new search 
is then made and the result is presented to the user and the 
entire process is repeated until the user is satisfied with 
the result. One of the retrievai systems which empioys the 
relevance Bae back technique is the SMART retrieval 
systen[ 1]. 

We Can conceive that an on-line processing environment 
is quite suitable for a document retrieval system and it is 
useful to investigate some aspects involved. In particular, 
one important aspect that the success of an on-line 
processing environment might depend upon is the file 
Organization. In the next section, several file 
organizations which may be suitable for on-line processing 


are discussed. 
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1.2 FILE ORGANIZATION 

The different objectives for different information 
systems demand the development of various data or file 
organizations. Each type of objective may call fora 
piece one file organization which will better satisfy the 
requirements. In some cases, Such aS in an inguiry systen, 
data must be retrieved rapidly while updating of existing 
data can proceed at a slower rate. In other cases, such as 
an order entry system, it iS necessary to store rapidly a 
large volume of data which are to be retrieved at a slower 
pace. 

Since this thesis is interested in document retrieval 
systems and in particular the file organizations, it is 
important at this point to examine several file 
organizations which support document retrieval systems. In 
all, four types of file orgnization in a document retrieval 
system environment are discussed: sequential, chained, 


inverted and clustered. 


A. Sequential Files 
Records or documents in a seguential file are stored in 
the order of acquisition and the ith record is retrieved by 


a sequential scan of the previous i-1 records. The access 
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time in a seguential file is 24 jarge that this kind of file 
organization is normally not suitable for on-line retrieval. 
Furthermore the sequential scheme also presents problems in 
updating and deleting of records. In particular, after a 
deletion, unless the file is reorganized, the unused space 
once occupied by the deleted record is rather difficult to 
recover or reuse. | 

In spite of the above disadvantages, sequential files 
Rawensoud favorable properties. Since the entire record can 
be examined directly, any information about the documents 
can be retrieved at once. Furthermore the overhead is low, 
bonce ho pointers or directory are involved. The sequential 
file organization is Suitable for a data acquisition systen 
Or a system which does not involve too much retrieval but 


has to produce many reports from the existing records. 


Documents or records having common terms or keywords 
are grouped together to form a set. The documents in each 
set are chained together by using pointers. A file directory 
is used to facilitate the searching of chain heads. The file 
organizations is depicted in figure 1.1. The head of each 
chain can be located by using binary search, hashing or some 
other table look up techniques. Subsequently, elements of 
all the chains are fetched and correlated with the query. 
The ones with "high* correlations are retrieved. 


Owing to the structure of the chained file, a number of 
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pre-search statistics are available. For example, the total 
dength of all the chains to be retrieved can be used as an 
upper bound to the number of retrievable documents. 
Furthermore, the length of a particular chain may also be of 
interest to some people. 

Despite the features mentioned above, there are some 
drawbacks in using the Chained file. First, a great agai of 
work has to be done in order to update or delete a document. 
eetond’: the number of documents with many terms in common 
With the guery is usually small as compared to the total 
number of documents accessed. Third, the heavy storage 
overhead for the directory and pointers may make the system 
too costly to implement. Finally, the merging of the actual 
documents from each chain deteriorates the retrievai time to 


a great extent. 
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C. Inverted Files 

The design of an inverted file is essentially the same 
as that of a chained file, the only difference is that every 
index term of the directory in an inverted file is 
associated with a set of pointers which is known as the 
accession list. Figure 1.2 depicts the organization of an 
inverted file. For any given query, the search begins by 
scanning the directory to obtain the accession list for each 
query term. All the fetched lists are then merged to forma 
Single list. Documents are then obtained by using the 
pointers in the merged list and correlated with the given 
user. Documents with ‘high' correlations are then retrieved. 
During the process of list merging, the inverted file scheme 
eliminates many duplicate documents, thus fewer number of 
documents will be accessed from the Secondary memory and 
that represents a substantial PL ys of retrieval time being 
saved. However, the inverted file scheme also introduces 


tremendous amount of storage overhead. 
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D. Clustered Files 

All documents in the file are divided into subsets or 
clusters by using a certain classification procedure such 
that related documents are placed in the same cluster. Some 
examples of classification procedures are due to Needhan, 
Doyle, Rhodes, Rocchio and others[3,4,5,6]. 

“The contents of the documents in each group or cluster 
are then described into a profile or centroid. Some profile 
definitions are described in Murray[2]. A hierarcny of 
profiles can be built if the number of profiles is large. 
The user guery is correlated with the profiles and the 
clusters which have a large number of related documents are 
retrieved. Figure 1.3 depicts the organization of a cluster 
file with 2 levels of profiles. 

There is always a trade off between cost aad 
performance in cluster files. More 'close' documents will be 
retrieved, if more clusters are searched and the cost will 
be higher. One major disadvantage of using cluster files is 
the expense in the initial classification procedure 
(clustering). 

The maintenance of a cluster file is also a problen. 
Initially new documents can be inserted into the appropriate 
clusters without changes to the profiles. However, after a 
period of time, the quality of the profiles diminishes 


because the profiles try to represent too much information. 
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Eventually the whole file aan eet von deteriorates anda 
reclassification has to be done. 

Since computer systems retrieve one block of data ata 
time from the secondary storage and each node in the file 
(e.g. a profile or cluster) is usually packed in the same 
block, the access time iS proportional to the total number 
of nodes searched. Furthermore, since actual documents are 
available to be examined directly, relevance feed back can 


easily be implemented. 
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in summary, we have described four types of file 
organizations and each has a different design philosophy. 
Since the speed, storage reguirement and flexibility play an 
important role in the success of a document retrieval 
system, comparisons among the four organizations in these 
three aspects are made as follows. First, in terms of speed, 
it ies guite obvious that a sequential file is not capable of 
giving a fast response to the user guery, since a sequential 
scan of the data file is required to locate the desired 
record. For chained files, in order to obtain the desired 
documents, a scan of the directory is required to obtain a 
list of documents for each query term and then a merging of 
all the documents together. Therefore the access time is 
proportional to the number of documents in all the lists 
associated with the guery terms. The inverted file is 
designed to have a faster access time over the chained file 
and the retrieval time is proportionai to the number of 
query terms and the number of documents retrieved. Note that 
the number of documents retrieved is usually a smail 
fraction of the number of documents examined by a chained 
file organiztion. A clustered file can also provide a 
relatively fast response time to the user query as compared 
to that of the sequential and chained files, since it only 
involves the retrieval of the appropriate profiles and the 


corresponding clusters. Furthermore, a clustered file is 
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likely to have a faster retrieval time than the inverted 
file when the number of query terms increases, since its 
retrieval time is relatively independent of the number of 
query terms. 

In terms of storage requirement, the amount of storage 
required by a sequential file is minimal, Since no pointers, 
directory or other overhead are involved. In the case of a 
chained file and especially an inverted file, since a large 
directory has to be maintained, the storage overhead 
involved could make the fiie size rather large as well. In 
the cluster scheme, the overhead is the space used by the 
profiles. It is difficult to state the total storage 
requirement accurately since it depends on the number of 
hierarchy levels used. However according to Murray [2] the 
performance level can still be maintained even when the 
profiles are reduced to less than 10% of the space used by 
the documents. 

Finally, in terms of mie eee inverted and chained 
files can supply some pre-Search statistic, for example, the 
number of documents containing a number of specified index 
terms. The clustered scheme cannot provide this same data, 
but offers the possibility of interrupting a search and 
viewing checkpoint information. 

From the above comparisons, we can conclude that both 
inverted and clustered schemes are suitabie for on-line 
processing. However in this thesis, we will concentrate on 


the retrieval problems of the cluster file organization. 
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1.3 RETRIEVAL IN CLUSTER FILES 

In some cluster file retrieval systems, the profiles 
are generally much smailer than the clusters. Therefore they 
can be kept in the main memory for fast access while the 
clusters are stored in the secondary memory. 

In large data sets, the number of clusters and profiles 
involved could be large thus it may be advantageous to group 
"similar" profiles together to form a hierarchy of profiles. 
However for the sake of this presentation, we will assume 
one level of profiles, though the same approach can be 
applied to multi-level profiles. 

In practice, the user specifies a guery and N, the 
number of records to be peeeieeed. The retrieval system then 
correlates the query with the profiles and the clusters 
whose profiles have high correlations with the query are 
retrieved. The query is then correlated with the records in 
the retrieved clusters and the records are ranked according 
to the correlations. The N records with the highest 
correlations are then retrieved. 

There are a variety of ways to correlate queries with 
profiles. This thesis essentially examines the method as 


proposed by C. fT. Yu and W. S. Luk[ 10 j. 
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Their method is based an a probabilistic model such 
that prediction on the number of documents ina given range 
of ‘accuracy' is made on every cluster. The definition of 
accuracy, according to their model, is the total number of 
attributes in common between a query and a document. Thus 
the clusters can be ranked according to the number of 
documents in a certain accuracy level. The clustering model 


is introduced in the next chapter. 
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CHAPTER 2 
CLUSTER FILE RETRIEVAL SYSTEMS 
2.1 A_PROBABILISTIC MODEL 


Most users of cluster file document retrieval systems 
are interested in obtaining a certain number of documents 
which are closest to their queries. For example, a cluster 
file information retrieval system for a real estate company 
can be designed to provide a list of houses which will match 
aS Many customer specified features as possible. Most 
cluster file document retrieval systems generally do not 
provide the users, before the retrieval of the actual 
records, with information such as estimates of the number of 
records which are reasonably close to the user gueries and 
the clusters where these records can be located. However if 
this kind of information is Sen aes then the users are 
able to adjust the query to obtain an optimum amount of 
output. For example, if there are too many records that are 
‘close’ to the query, the user can reduce the number of 
retrieved records by imposing more constraints on the guery. 

This thesis investigates the applicability and possible 
improvements of a probabilistic model which is constructed 
according to the above-mentioned facts. This model includes 
all common features of a cluster file information system and 


in addition, it will give estimates of the number of 
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documents in each cluster at a certain degree of ‘closeness! 
to a given user query. Suppose there are N attributes 
(al,...,aN) in the entire data base and every record or 
document is defined to be an N-dimensional binary vector. 
The queries to the system are also defined in the same way. 
A "1" in the ith position or co-ordinate of a document or 
query A fees the presence of the ith attribute and a '0! 
indicates otherwise. The profile P of each cluster is 
defined to be the vector sum of all the records. Thus if 
P=(p1,..--,PN), then pi is the number of records in the 
cluster having the ith attribute. Furthermore, the 
closesness between a query X=(x1,...,xN) and a document 
Y=(y1,--.,yN) is measured by the number of terms or keywords 
in common between them. Formally, their closeness can be 


N 
measured by the simple matching function ys 


ory 

nk an bs 
i=l 

The analysis of this model is based on the following 


assumption: 


(i) the attributes in the data base are mutually 


independent. 


With the above assumption, the expected number of 
records in a cluster of M records, having exactly i 
attributes in common with a given query Q is derived as 


follows: 


let Q be a query which contains exactly L attributes. 
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Without loss of generality, let O-CONLaIN Ene (fi rstuLl 
attributes in the data base. 

Let the profile of the cluster be P=(p1,...,pN). The 
probability that a randomly chosen record from the cluster 
contains attribute ai, 1<=i<=L is pi/M. On the other hand, 
the probability that a record does not contain ai is (1- 
pi/M). Thus by the independence assumption, the probability 
that a record contains attributes al,...,ai but not 


Getic pal LS 


i Pie 
TT Pp, /M Pie steel Pa ws) 
k=1 k=1+41 


Since there are G different ways of choosing i 
attributes out of L, the probability that a randomly chosen 


record has exactly i attributes in common with Q is equal 
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expression by C(aj,--++,a,,14) and cail it the c- 
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L 
attributes in common with the query is Be eae ce ta te a) 
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By uSing the expressions that are derived here, it is 
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records can be selected for retrieval. 

Even though the relevance of a retrieved document is 
not handled here, it is likely that the more attributes that 
a retrieved document has in common with a given guery, the 


higher the chance the document is relevant to the guery. 
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2-2 (20 COMPUTE THE C FUNCTION 

According to the definition, the C-function is not easy 
to compute. Since it involves a high polynomial or even 
exponential number of combinations of the frequencies of the 
attributes (as demonstrated in section 4.2 ), the time spent 
in computing the function for a given cluster may exceed 
that of retrieving all the records in the cluster. In this 
thesis, a more efficient method is used and is outlined as 


TorLrows: 
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are the corresponding relative frequencies of occurrence in 
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then 
ai is the probability that a document in a cluster with 
exactly i attributes in common with the query Q. 
Froor 
It is trivial for the case of a0 and aL. Since a0=(1- 


£1)..-(1-fL), a0 is the probability that a document does not 
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contain qi,...,ql. Similarly the same reasoning can be 
applied to).aLl. 


Consider the case of ai. 
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= C(qy5+++94,>4) 


=the probablity of a record has exactly i attributes 

in common with a given query. 

Where { lover eee } is the disjoint union of the 
permutations {¢(1),...,9¢i)} and Pei wenn) f 

For the remainder of this thesis F is referred to as the 
generating function of a query Q. The following lemma is an 


extension of Lemma 2.1. 


Lemma 2.2 

Let Q1 and Q2 be two disjoint queries whose generating 
functions are G1 and G2 respectively. If the terms in Q1 are 
independent from those in Q2, then G = G1 * G2 is the 
generating function for Q1 U Q2, i.e. the probability that a 
document in a cluster has exactly K terms in common with Q1 


U Q2 is the coefficient of X in G. 
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since a5 is the probability that i terms in Q1 co-occur 
and 

b, LS the probability that.j terms in Q2 co-occur 
and 
the terms in Q1 and that in Q2 are independent 

is 


Osi, j<k is the probability that i+j=K terms in Q1 U 
i+j=k 


Q2 co-occur 


Thus G is the generating function of Q1 U Q2. 
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2.3 EXPERIMENT ON_A_REAL DATA SET 
In the previous section, we have derived eStimates on 
the number of records in a cluster having K or more 
attributes in common with a given query (the C-function) and 
the techniques to compute the estimates. In this section, an 
experiment is set up for applying the retrieval concept on 
an actual data set and comparing the results obtained with 
the theoretical estimates. The procedures in carrying out 


the experiment is outlined as follows: 


(i) The CRN1400 collection is selected from the SMART 
RETRIEVAL SYSTEM [1] as the testing data set. The collection 


consists of 1400 documents and 225 queries. 


(ii) Subroutine CLUSTER which uses the clustering algorithn 
designed by J. J. Rocchio [6] is selected from the SMART 
RETRIEVAL SYSTEM to perform the clustering on the documents 


of CRN1400 collection. Profiles are then created from the 


resulting 16 clusters. 


(iii) The C-function of each query with respect to each 
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cluster is obtained. Documents are retrieved from the 


cluster whose C-function value is the highest. 


(iv) Let Oki, Tki be the actual and expected number of 
documents with K or more attributes in common with the ith 
query. {In the remaining part of this thesis, K-value is 


referred to as K or more attributes in common with a given 
query). 

Sincé Tki is a real number and Oki is an integer, the 
estimation does not seem to be realistic enough. In order to 
Make the estimation more meaningful, an integer upper and 
lower bound are used. Thus if Oki is within the two bounds 
then the estimation error is zero. The upper and lower bound 


used in this thesis are ue and |r 
ki ki 


Where [Ta] is the greatest integer smaller than or equal 
to Tki and 
[Ta] is the smallest integer bigger than or equal 


to Tki 


The percentage error is then defined to be 
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The average error percentage over M1 queries on the 
estimation of the number of documents with K or more 


attributes in common with the queries is defined to be 
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The result of the wipe kent is shown in Table 1. 
Queries of approximately the same length are grouped 
together and the average error percentage at each K-value 
over all queries in a group is obtained. Even though in some 
cases the average error percentage is guite low (less than 
10%), there are cases where the average error percentage is 
quite high. Hence an attempt is made to locate the source of 
errors. In the next chapter, an error analysis is made and 


the C-function is modified and better results are obtained. 


bg . ery ; “4 , Dat » sb a 
eae <7 Morne: 
af aniser, veo 


Na weet 
- tt 1 Ail 


seqdory ae 
on s¥=it “W358 ae] eames 30738) 5 19¥: 


z a. aie 


ague al‘ dpiteds 18 va hatte se. rsa query met 
dic ds. Seth UO eStue Be, i gutascaag baie api 
ad . aren 10338 ap8x895) eds eipdw depts | 
ti 4OUVUCE wh) Ss e0L, os 3 Diem SL sqitotd s ag eer 
| Sns 2bae"er atayleds 1 nes 7 ab ad ji rxsu ditt os 


beninsdtresn esi 4eos gesdad |bub beri thok ais solaoastss 


os 


| ee 


Bi« Se= 46 


28 


Experimental results by using CRN1400 collection 
(225 queries and 1400 documents) 

ho query modification 

k = the minimal number of attributes which a 


record possesses in common with a query 


length of queries = 3 - 6 


no. of queries ave err 
55 13.44 
51 21.78 
34 “hess 
13 S255 
3 0.00 


length of queries = 7 - 8 


no. of queries ave err 

56 8.56 
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50 33.68 
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length of queries = 9 - 10 
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3 
CHAPTER 3 


ERROR ANALYSIS AND QUERY MODIFICATION 


In observing the results presented by different queries 
from CRN1400, the theoretical values are in most cases below 
the actual values in high K-value (ref. Section 2.3 for 
definition). This simply means that some attributes co-occur 
more often than is assumed. In fact, the occurrence of the 
attributes of a given query are usually not mutually 
independent. 

The other source of error is coming from the low 
estimation value at high K-value. Suppose the prediction or 
the C-function value is 1 and the actual value is 2, the 
error is aiready 50%, even though the difference between 
them is 1 only. 

Despite the error caused by the latter source is quite 
high in some cases, we make no attempt to correct it, since 
the cause of the error is quite natural and is unrelated to 
the model of this thesis. However, an attempt is made to 
correct the former source by relaxing the attribute 
dependency assumption and an analysis is outiined in the 


next section. 
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3.2 ASSUMPTION RELAXATION 

In an actual data base, it is usually not true that all 
the attributes are independent and in fact some of the 
attributes almost always occur together in the same 
documents and this is a sign of attributes dependency. The 
C-function is derived by assuming the independence of 
attributes and violation of this assumption would lead to 
the inaccuracy of the C-function. One way to remedy this 
defect is as follows: divide the attributes of a given gquery 
Q into two sets: set S1 and set S2. Where set S1 contains 
attributes from Q which are dependent on each other and set 
S2 contains attributes from Q which are independent of each 
other and independent from the attributes in S1. The 
theorectical generating function of S1 and S2 with repect to 
each cluster are then obtained. According to lemma 2.2, the 
theoretical generating function for Q can be expressed as a 
product of 2 polynomials. 
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the big error percentage, since the attributes do not 
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Satisfy the independent assumption of the model. Thus we can 
expect to improve the accuracy of F, if £1 is somehow 
modified. One way to modify f1 is by replacing it with the 
actual co-occurrence polynomial f,=(4,ta,xt --. +a,x") - Note 
that the actual co-occurrence polynomial of a query is a 
polynomial whose coefficient ai represents the aie te ce 
frequency of the number of documents in a given cluster with 
exactly 1 attributes in common with the given query. Thus 
after replacing £1 by £2, F now becomes 

F = (a, + a,x 2 aa a,x") (c. + c4x eS ee cog tg 

In order to find out whether the query splitting 
concept can really improve the retrieval result, an 
experiment based on this concept is devised. Apparently, one 
problem in conducting this experiment is to determine in 
each cluster the dependency of the attributes. 
Unfortunately, this is a very expensive process, since all 
the co-occurrences of i different attributes for 1 <= i <= N 
and all possible combinations in eal case have to be 
considered. In fact the time bound needed to complete the 
process is of 0(2"), where n is the total number of 
attributes in the data base. 

Assuming the data base has been in use for some time so 
that a set of representative user queries is obtained, the 
set of queries can then be used to determine the dependent 
relation of the most freguently occurred attributes (to be 


described below). The results will then be used to assist in 


the retrieval of documents by future queries. In this 
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thesis, an experiment is set up to determine the dependency 
relation of the most frequently occurred attributes and it 
is described as follows: 

The set of user queries is divided into two sets: set aA 
and set B, such that the attributes in any query in set B 
are a subset of the attributes of the queries in set A. Set 
set. The training set is used as a source for finding all 
the dependent attributes with respect to each cluster of the 
data base. The results are used to modify the generating 
functions of the queries in the testing set by using the 
query splitting concept. The same retrieval procedure as 
described in section 2.3 is then applied to the queries in 
the testing set and hopefully more accurate prediction can 
be obtained. 

In a realistic retrieval environment, the data base is 
assumed to have been used for a reasonably long period of 
time. Thus representative queries can be collected. This set 
of queries can be used as set A in the above description. 
Recent and future queries then correspond to set B. 
Therefore, in practice, the strategy described above can be 


implemented. 
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aera THE DEPENDENT ATTRIBUTES 

In this thesis, an algorithm which uses the 
contingency table 
{ APPENDIX 1) is developed to locate the dependent 
attributes in a given query. The details of the algorithm 
are given as follows: 


Definition: let Fy=c +c, x+ ots +c,x be the generating function 


8 1 


of a given query with respect to a cluster and 

Forajta,xt ... ta,x” is the actual relative frequencies of co- 
O 

occurrence function with respect to the same cluster. Then 


a 


Eb abe. 
tbh [att star) 4 defined to be the error of Fl. 
zx 


Let Q=(q1,--..,qgL) be the given query, E an error 
tolerance and C be the cluster from which the query 
retrieves documents. The error of the generating function of 
O\gs) firstiobtained. If the, errorjis Jess than oc equal toxk& 
then Q is assumed to have no dependent attributes and the 
algorithm stops. If the error is greater than E, then all 
possible combinations of two attributes are enumerated and 


for each pair of attributes, their chi-square value is 
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obtained from the contingency table. Each pair of attributes 
is then ranked in Geecenaute order of their chi-Sguare 
value. Since the higher the chi-square value, the higher is 
the probability that a pair of attributes are dependent 

( APPENDIX 2), it is reasonable to assume that the highest 
ranked attribute pair are more dependent than the next 
highest ranked pair and so on. Thus the generating function 
of Q is modified to be F = F1 * Fr, where F1 is the actual 
Be seetanence function of the highest ranked attribute pair 
and Fr is the theoretical generating function of the C- 
function of the rest of the attributes. The error of F is 
then computed. If it is smaller than E then it may be 
asserted that there is only one pair of attributes in Q. 
However, if the function error of F is still greater than E 
then F is modified further into F = F2 * Fr, where F2 
includes the actual co-occurrence function for the first and 
the next highest ranked attribute pair and Fr is the 
theoretical generating function for the rest of the 
attributes. This procedure is repeated until the error is 
less than E. The details of this algorithm is given in 


APPENDIX 3. 
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3.4 EXPERIMENT SETUP AND RESULTS 


The experimental environment remains the same as that 
described in section 2.3. Instead of using all 225 queries 
in CRN1400 for retrieving documents, they are divided into 
two sets, set A(155 queries) and set B(70 queries). (ref. 
Section 3.2 for definition). Essentially set A is used to 
obtain dependent attributes with respect to each cluster in 
the data base. The dependent attributes are then used by set 
B to modify the C-functions of its queries for the retrieval 
of documents. 

The dependent attribute set Di, with respect to cluster 
i, is obtained by applying algorithm 3.1 to each query in 
set A. The attributes of each query in B with respect to 
each cluster are then divided into two sets: the dependent 
and the independent attribute set. The dependent set ina 
given guery in B with respect to cluster i is obtained from 
the intersection of the query and Di. The independent set 
consists of the remaining attributes of the query. The 
generating function for each query with respect to cluster 1 
is then of the form F = F1 * F2, where F1 is the actual 
frequencies of co-occurrence function of the dependent 


attributes with respect to cluster i and F2 is the 
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generating function for the independent attributes. 
Documents are then retrieved from the clustered data base 
(as described in section 2.3). The results as shown in TABLE 
2 (with an average error of 11.06%) indicate improvement 
over the original results (with an average error of 13.12%). 
The following is a diagram showing the performance of a 


typical query. 


NUMBER OF 


a - actual 
DOCUMENTS 


m — modified C-function 


c -— C-function 


NUMBER OF TERMS 
MATCHED WITH 
THE QUERY 


DIAGRAM 3.1 
THE PERFORMANCE OF A TYPICAL QUERY IN SET B- 


BY USING THE C-FUNCTION AND THE MODIFIED C-FUNCTION 
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TABLE 2 
Experimental results by using set B gueries only. 
The C-function was modified by using the 
dependent term set Di. 
k = the minimal number of attributes which a 


record possesses in common with a guery 


length of queries = 3 - 5 


no. of queries ave err 
19 4.94 
16 5.46 
10 25.00 
5 0.00 


length of queries = 6 - 7 


no. of queries ave err 
17 3.45 
17 6591 
17 8.18 
10 10.00 
7 7.14 


length of queries = 8 - 9 


no. of queries ave err 
Ze 4.07 
22 Sirol 
22 ha39 


20 18.16 
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From the performance of the gueries in set B, the 


following observations are made: 


(1) When the number of dependent attributes in a query is 
less than L/2 then the error increases approximately at k > 
L/2, where L is the length of a query. 

(ii) In most cases the modified C-function is 
underestimating the true value at k > L/2, implying there 


are still signs of dependency. 


It was believed that more improvement can be achieved 
by modifying the C-function further with the above-mentioned 
observations. The observations suggest that the independent 
attributes ina query are also dependent to a certain 
extent, even though their dependency is not as strong as 
that of the independent attributes. Thus by using the error 
made by the dependent attributes as an upper bound (the 
difference between the C-function and the modified C- 
function), we can modify the C-function further by shifting 
the modified C-function value towards the actual value. The 


following is a modification scheme used by this thesis. 


Let the query Q=(q1,---,qL) and the corresponding frequency 


be (p1,--.,pL) 


(2) the modified C-function is modified further at L/2. 
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Denote the difference between the modified C-function and 
the C-function at K > L/2 by ak 

(11) Without loss of generality, let D1=(q1,...,gi) be the 
dependent attributes of Q and D2=(gqit1i,...,qL) be the 
remaining set. 


ig 2tpe 12h 1. >= L/2) then no vtuptner’nodification is 


required. 
a is 
(iv) if i < Ly2 then let F,= ) P/i ana F, = Je PR ahi) 
j=1 j=itl 
(v) 
y 3 
MC) = Me x (1 + Pree), 29) 
L/2 L/2 foe ie ray 
where MCI) 5 1s the difference between the new modified 


C-function and the C-function. This step 'shifts' the 
modified C-function value(as shown in diagram 3.1) towards 
the acutal value. 

The modification is based on the assumption that the 
attributes in D2 are also dependent to a certain 
extent(i-e. we also have to consider the error made by the 
remaining L-i attributes). Since the dependency of the 
attributes in, D2 42S not as strong as that ain D1, 

MC, i can be considered aS an upper bound to the 
error made by any i out of L attributes. Thus it is 
reasonable to assume that the maximum amount of error made 
by the remaining L-i attributes made is MG /2 * (L-1)/1. 

Furthermore the document frequencies of the attributes 


also affect the modification. Suppose the average document 


frequency of D1 is 50 and the error made by these attributes 
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is 20, then we expect that the error made by D2 with an 
average document frequencies of 10 is less than 20. In fact 
this thesis assumes that the error made by D2 is | 
proportional to the document frequencies. Thus the new 


modified C-function value becomes 


feat 
* 
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Therefore the difference between the new modified C-function 


and the C-function is 


bs 
MC! eMC + MC * 
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The refined modification procedure was then applied to set B 
queries. The results as Shown in TABLE 3, indicate an 
improvement in all cases. The average error is about 64. For 
the group of queries of length 10 - 15, significant 


improvement is obtained. 
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Experimental results by using set B gueries only. 
The C-function was modified by using the 
dependent term set D and the refined modification 
procedure. 

k = the minimal number of attributes which a 


record possesses in common with a query 


length of queries = 3 - 5 


no. of queries ave err 
19 4.94 
16 2.34 
10 15.00 
5 0.00 


length of queries = 6 - 7 


no. of queries ave err 
17 3.45 
17 6.36 
17 is Va 
10 10.00 
7 0.00 


length of queries = 8 - 9 


no. of queries ave err 
22 4.07 
22 Shey 


wap 6.98 
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CHAPTER 4 
TIME BOUND ANALYSIS OF ALGORITHMS 


4.1 INTRODUCTION 

In order to choose an appropriate algorithm from a set 
of potential algorithms which can be used aS a solution toa 
Aareanear yt it is desirable to evaluate the cost of each. The 
most common criteria used to evaluate algorithms are the 
time and space consumed. 

Essentially, these two criteria are measured in terms 
of the ‘size’ of a problem which can be defined as a measure 
of the input guantity[9]. For example, in adding two 
matrices, the size of the problem is the dimension of the 
Matrices. in computing the C-function, the size of the 
probiem is the length of the query. 

The computing time consumed by an algorithm can 
generally be expressed as a function of the size of the 
problem. The time complexity of an algorithm is said to be 
of order f(n), if there exists a constant c such that the 
number of steps executed by the algorithm is always less 
than Beata to cf(n), where n is the size of the problen. 
For example, an algorithm takes 2n3 operations to compute 
has time complexity of order n3( or denoted as O(n3)). A 
similar definition can be applied to the space consumed. 


Furthermore, according to their time complexity, 
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algorithms can roughly be classified into two groups: 


deterministic polynomial and exponential. 


Definition. An algorithm is said to be deterministic 
polynomial if there exists a polynomial p(n), such that the 
number of steps executed by the algorithm is less than or 


equal to p(n), where n is the size of the problen. 


Definition. An algorithm iS said to be exponential if it 
runs in exponential time, i.e. the algorithm is of order 
nk , where k > 11S a constant and n is the size of the 
problem. 

Exponential aigorithms are usually applied to problems 
of smaller size, since the amount of running time required 
by an exponential algorithm for a large size problem will be 
too large for the algorithm to be feasible. | 

On the other hand, deterministic polynomial algorithms 
of low degrees are generally desirable for most 
applications. The algorithms used in computing the C- 
function, for example, are deterministic polynomial of 
degree 2 (will be shown in section 4.3). 

The analysis of time bound required by some algorithms 


used in this thesis is described in the following sections. 
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jet the given query be Q=(qi,...,gL) and its relative 
frequencies of occurrence with respect to a cluster C of M 
Tecords be..F=(£1,....,fL).. 

Recall that C-function is defined to be the probability that 
a record in Cy,contains exactly 1:.out of L..attributes in 0. 


From equation 2.1, C-function can be expressed as follows: 
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therefore TCE E Since this algorithm runs in exponential 
time, it cannot be applied in general (especially in the 
case where the length of the query is long). Thus alternate 
algorithms are developed to compute the C-function. 

The time bounds required by these algorithms are 


described in the next section. 
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4.3 TIME BOUND FOR POLYNOMIAL MULTIPLICATIONS “ 
According to Lemma 2.1, the problem of ee apenas the C- 
function is equivalent to obtaining the product of L 
polynomiais of degree 1, where L is the length of a given 
query. There are Many algorithms in solving this polynomial 
multiplication problem and the fastest known algorithm is 
called the Fast Fourier Transform[9]. The time bound 
required by this algorithm is O(L log L) with a reasonably 
large constant factor. As will be demonstrated later in this 
section, the usual way of multiplying L polynomials together 
is of O({L2) operations with a small constant. Thus, the Fast 
Fourier Transform is preferable to the standard method only 
if Lis sufficiently large. However, the number of 
attributes ina query is usually not large enough to warrant 
the Fast Fourier Transform method. Two other algorithms with 
a slightly higher time bound are considered instead. The 
time bounds of these two algorithms are outlined as follows: 
i) Iterative multiplication of L polynomials of degree 1. 
Let y = (alxtb1) * (a2xt+b2) ...*(aLxtbL) 
the numbers of multiplications and additions required in 
multiplying the first two polynomials are respectively 2 * 2 
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The numbers of multiplications and additions required in 
multiplying the product of the first two polynomials by the 
third polynomial are 2 * 3 = 6 and 2 respectively. 

in general the total number of operations needed is 


(242+1) # (2*3+2) + serernt (2e ut b= } 
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thus the time bound reguired by this algorithm is O(L2). 


ii) Group multiplication of L polynomials of degree 1. 
Assuming the initial number of polynomials is L = 2 ; 
where k iS an integer and each polynomial is of degree 1. 
We can form a binary tree by using these L polynomials such 
that the nodes of the lowest level of the tree respresent 
the polynomials of degree 1. The parent of any two nodes 
represents the product of the two polynomials. The root of 
the tree is then the final product of all the terminal 


nodes. 
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A binary tree representing the multiplication of 8 


polynomiais of degree-1 is described below: 


Level 2 
ae polynomials of 
degree 2 
Level 0 polynomials of 
degree 1 


The number of additions needed to multiply two polynomials 
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Thus the time bound required by the algorithm is O(L2). 


Since algorithm (ii) has a smaller constant factor, it is 


used in this thesis for computing the C-function. 
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APPENDIX 1 

Definition: W is defined to be the Pearson product-moment 
coefficient calculated on nominal-dichotomous data. Its main 
function is to measure the correlation between Pandan 
Vataabless,i.e:..the bagher/ the & value between 2 variables, 
the higher is the correlation between them. 

Gene V. Glass and Julian C. Stanley[12] state that: 
if Z is the standard normal random variable with mean O and 


Variance 1 then 9 9 
Yn &= 2 = nQY = 2 


Re eee 


since 3? = aF (Chi-square with 1 degree of freedom) 
a 


=> the higher the chi-square value, the higher is the 


correlation between 2 variables. 


iné O jes 0 H4 ia 5) iz G. LBV @o.00'e2 es, iad * SAR oi at 


iivseen-psv berg do siess ews ad oF 1 bayeraab 


maa 21°75 sseb. waonorfasatn DSi a fo, UO yea: Saad 2 
PObts. Aeawesd dort Lass wee ounce 
“2h 
eidisysv 2 ageveodt orev’ bh sha ltyag yet wise 
= fri was 
con ° qe-rcsn0 aot sioiney ‘sit “a ; 
= won 1A 


. , | 4 - - tS 
wah BY dae 93 ATS > 


Diyelinnte tet Baa agit. .¥ 's 


, 


wis? S0 fav ee eran cap 


APPENDIX 2 
Let x11 be the number of documents in the data base that 
contain attributes X1 and X2. 
Let x12 be the number of documents in the data base ena 
contain attrrbute x2 put not XT. 
Let x21 be the number of documents in the data base that 
contain attribute X1 but not X2. 
Let x22 be the number of documents in the data base that do 


not contain attributes X1 and X2. 


The following is a contingency table for attribute X1 and X2. 
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APPENDIX 3 
Let Q = (q1,---,qL) be a given guery and E be the error 
tolerance and let V=(V1y-206 V5) Ve 
U = (u11,u12) , (U21,022) 4... 4 (Yr) Tea) 2)) and 
GF = (Fly.-20FC) ) such that cana) where vi and Fi 
are respectively the chi-sguare value(obtained from the 
contingency table) and the generating function of the 
attribute pair (uii,ui2), 1<=i<=(), with respect to a 
cluster. 
The procedure in carrying out ALGORITHM 3.1 is outlined as 


follows: 


1) if the function error of the generating function of the 
C-function of Q is less than E, then all the terms in Q are 


independent and the algorithm stops. 
2 wet. i=" 1 


3) let the generating function of the C-function for Q be 


alt 


F= TT x 
iat: Fk ae 


where Fr is the generating function of the C-function of the 

set 1(.05— U (gla j2)i ie Note that lit Po= peor 2 ie er vand 
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(u11,u12) nM (u21,u22) is not empty and suppose u12 = u21 


theh (ui1,u12,u22) are considered as 3 dependent attributes 
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and F = Fl * Fr, where Fl is the actual co-occurrence 
function of (u1l1,u12,u22) and Fr is the generating function 


of the C-function for the set (Q - (u11,u12,u22)). 


4) obtain the function error of F, if F is less than E then 
the dependent pairs are (uj1,uj2) j=1,i and the algorithm 


stops otherwise let i= i+ 1 and go to step 3. 
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